This article continues from 10 Essential Python Data Cleaning Techniques for Web Scraping, focusing on practical data cleaning methods that modern scraping projects rely on daily.
Most real-world scraping tasks involve APIs, JavaScript-rendered data, and unstructured text. Therefore, mastering Python data cleaning for web scraping is essential for building stable and scalable crawlers.
3. JSON Data Cleaning
Today, most websites expose data through APIs, and JSON has become the dominant response format. As a result, Python developers must handle JSON efficiently.
Example: Star Wars API (SWAPI)
API endpoint:
import requests
url = "https://swapi.dev/api/people/"
response = requests.get(url, verify=False)
json_data = response.json()
print(json_data["results"]) # Access character data
Handling Non-English Characters
When JSON contains non-English text, such as Chinese characters, you should explicitly set the encoding:
response.encoding = "utf8"
json_data = response.json()
This step prevents garbled text and ensures accurate downstream processing.
4. Storing JSON Data in MongoDB (NoSQL)
When JSON structures become deeply nested, traditional SQL databases introduce unnecessary complexity. In contrast, MongoDB handles nested documents naturally, making it a strong choice for Python data cleaning for web scraping.
Installation
pip install pymongo
Insert JSON Data into MongoDB
import pymongo
client = pymongo.MongoClient(
f"mongodb://{user}:{password}@{host}:{port}"
)
db = client["db_spider"]
collection = db["wars_star"]
# Prevent duplicate insertion
collection.create_index("name", unique=True)
collection.insert_many(json_data["results"], ordered=False)
Query Examples
Find characters whose names contain “Le”:
db.getCollection("wars_star").find({ name: /Le/ })
Find characters appearing in a specific film:
db.getCollection("wars_star").find({
films: { $in: ["https://swapi.dev/api/films/1/"] }
})
Because MongoDB supports flexible schemas, it simplifies storage and querying of API responses with variable fields.
5. Handling JavaScript Object Data (JSONP)
Some websites return data wrapped inside JavaScript objects rather than pure JSON. Financial websites often use this pattern.
Example: Parsing JavaScript Object Data
import demjson
# Extract JavaScript object
js_data = response.text[
response.text.find("=") + 2 : response.text.rfind(";")
]
# Decode JavaScript object
raw_data = demjson.decode(js_data)
rank_list = [item.split(",") for item in raw_data["datas"]]
This approach allows you to convert JavaScript-style data into structured Python objects without browser automation.
6. Regular Expressions: The Universal Tool
Even with structured APIs, some data only appears inside raw HTML or text. In such cases, regular expressions provide a reliable fallback.
Single Match with
re.search
import re
html = '<div class="q-text">6,526 followers</div>'
match = re.search(r">(.*?) followers<", html)
followers = int(match.group(1).replace(",", ""))
print(followers) # 6526
Multiple Matches with
re.findall
text = "Phone numbers: 18767543212 and 19767443218"
phones = re.findall(r"\d{11}", text)
print(phones)
# ['18767543212', '19767443218']
Regular expressions remain indispensable when APIs are unavailable or page structures change frequently.
When to Use Each Technique
| Scenario | Recommended Method |
|---|---|
| API responses | JSON parsing |
| Nested or flexible schemas | MongoDB |
| JavaScript-returned objects | JSONP + demjson |
| Unstructured HTML/text | Regular expressions |
In practice, effective Python data cleaning for web scraping combines multiple techniques rather than relying on a single solution.
Conclusion
In this chapter, you learned how to clean and process scraped data using JSON parsing, MongoDB storage, JavaScript object handling, and regular expressions. These techniques cover the majority of real-world scraping scenarios and integrate smoothly with larger crawling pipelines.
For more on extracting raw HTML data before cleaning, see:
Crawling HTML Pages: Python Web Scraping Tutorial
In the next installment, we will explore more advanced data cleaning strategies that further improve crawler efficiency and data quality.